Problem Statement¶

PROJECTO 1
Análisis exploratorio y modelado predictivo de precios de viviendas en Barcelona usando Python y SQL

Objective¶

Desarrollar un análisis completo y un modelo predictivo para los precios de viviendas en Barcelona, utilizando datos extraídos del portal Fotocasa. El objetivo es aplicar técnicas de extracción, manipulación y análisis de datos, así como algoritmos de Machine Learning, para predecir los precios de las viviendas en función de diversas características.

Data Description¶

  • price: The price of the real-state.
  • rooms: Number of rooms.
  • bathroom: Number of bathrooms.
  • lift: whether a building has an elevator (also known as a lift in some regions) or not
  • terrace: If it has a terrace or not.
  • square_meters: Number of square meters.
  • real_state: Kind of real-state.
  • neighborhood: Neighborhood
  • square_meters_price: Price of the square meter

Importing necessary libraries¶

In [288]:
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
%matplotlib inline
sns.set_style('whitegrid') # set style for visualization

# To supress warnings
import warnings # ignore warnings
warnings.filterwarnings('ignore')

from scipy.stats import zscore

#normalizing
from sklearn.preprocessing import MinMaxScaler, StandardScaler # to scale the data

# modeling
import statsmodels.api as sm # adding a constant to the independent variables
from sklearn.model_selection import train_test_split # splitting data in train and test sets

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# for validation
from sklearn.model_selection import cross_val_score, KFold

# Deploy
import joblib

Loading the Dataset¶

In [289]:
df=pd.read_csv('Barcelona_Fotocasa_HousingPrices.csv')

Data Overview¶

In [290]:
df.head() # preview a sample first 5 rows
Out[290]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
0 0 750 3 1 True False 60 flat Horta- Guinardo 12.500000
1 1 770 2 1 True False 59 flat Sant Andreu 13.050847
2 2 1300 1 1 True True 30 flat Gràcia 43.333333
3 3 2800 1 1 True True 70 flat Ciutat Vella 40.000000
4 4 720 2 1 True False 44 flat Sant Andreu 16.363636
In [291]:
df.tail() # preview a sample last 5 rows
Out[291]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
8183 8183 1075 2 2 False False 65 flat Gràcia 16.538462
8184 8184 1500 3 2 True False 110 flat Eixample 13.636364
8185 8185 1500 2 2 True True 90 flat Sarria-Sant Gervasi 16.666667
8186 8186 1500 3 2 True False 110 flat Eixample 13.636364
8187 8187 1500 3 2 True False 110 flat Eixample 13.636364
In [292]:
df.sample(20) # preview a sample random n rows
Out[292]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
64 64 1290 2 1 True False 70 flat Les Corts 18.428571
5695 5695 821 3 1 True False 70 flat Sant Andreu 11.728571
2514 2514 826 2 1 True False 58 flat Eixample 14.241379
6274 6274 1195 2 1 True True 70 attic Eixample 17.071429
1462 1462 2250 3 2 True False 170 flat Sarria-Sant Gervasi 13.235294
7803 7803 750 0 1 True True 43 flat Sarria-Sant Gervasi 17.441860
6078 6078 736 1 1 True False 46 flat Eixample 16.000000
7936 7936 1000 1 1 True False 95 flat Eixample 10.526316
3911 3911 1200 2 1 True False 70 flat Sant Martí 17.142857
7016 7016 800 2 1 True False 57 apartment Sants-Montjuïc 14.035088
5061 5061 900 2 1 False False 60 flat Eixample 15.000000
1689 1689 960 2 2 True False 65 flat Ciutat Vella 14.769231
3045 3045 1100 4 1 False False 80 flat Eixample 13.750000
1819 1819 750 2 1 False False 60 flat Gràcia 12.500000
2643 2643 2500 3 2 True False 185 flat Sarria-Sant Gervasi 13.513514
4891 4891 950 3 1 True False 70 flat Sarria-Sant Gervasi 13.571429
5415 5415 900 2 1 True False 70 flat Eixample 12.857143
3761 3761 975 3 1 True False 60 flat Eixample 16.250000
2306 2306 1200 2 1 True True 80 flat Eixample 15.000000
6678 6678 1135 3 1 True False 93 flat Eixample 12.204301
  • The variable 'Unnamed' represent index and should be deleted from data
  • Target variable for modeling is "price"
In [293]:
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features
There are 8188 rows and 10 columns.
  • There are 8188 rows and 10 columns.
In [294]:
df.dtypes # data types
Out[294]:
Unnamed: 0               int64
price                    int64
rooms                    int64
bathroom                 int64
lift                      bool
terrace                   bool
square_meters            int64
real_state              object
neighborhood            object
square_meters_price    float64
dtype: object
  • Data types are aligned with information
In [295]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           8188 non-null   int64  
 1   price                8188 non-null   int64  
 2   rooms                8188 non-null   int64  
 3   bathroom             8188 non-null   int64  
 4   lift                 8188 non-null   bool   
 5   terrace              8188 non-null   bool   
 6   square_meters        8188 non-null   int64  
 7   real_state           7920 non-null   object 
 8   neighborhood         8188 non-null   object 
 9   square_meters_price  8188 non-null   float64
dtypes: bool(2), float64(1), int64(5), object(2)
memory usage: 527.9+ KB
  • There is missing data (NaN) on variable 'real_state'. To be replaced by "unknown"
In [296]:
df.describe(include="all").T # statistical summary of the data.
Out[296]:
count unique top freq mean std min 25% 50% 75% max
Unnamed: 0 8188.0 NaN NaN NaN 4093.5 2363.816335 0.0 2046.75 4093.5 6140.25 8187.0
price 8188.0 NaN NaN NaN 1444.092574 1125.886215 320.0 875.0 1100.0 1540.0 15000.0
rooms 8188.0 NaN NaN NaN 2.420738 1.138592 0.0 2.0 2.0 3.0 10.0
bathroom 8188.0 NaN NaN NaN 1.508793 0.732798 1.0 1.0 1.0 2.0 8.0
lift 8188 2 True 5710 NaN NaN NaN NaN NaN NaN NaN
terrace 8188 2 False 6518 NaN NaN NaN NaN NaN NaN NaN
square_meters 8188.0 NaN NaN NaN 84.610161 47.874028 10.0 56.0 73.0 95.0 679.0
real_state 7920 4 flat 6505 NaN NaN NaN NaN NaN NaN NaN
neighborhood 8188 10 Eixample 2401 NaN NaN NaN NaN NaN NaN NaN
square_meters_price 8188.0 NaN NaN NaN 17.739121 9.245241 4.910714 12.790698 15.306122 19.444444 186.666667
  • Units size goes from 10m2 to 679m2, with a mean of 84.61m2
  • Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
  • price range is assumed referred to monthly rent, so considered as EUR per month
  • Units prices by square meter goes from 4.9EUR/m2/month to 186EUR/m2/month, with mean of 17.7EUR/m2/month
  • There are units listed with cero rooms
In [297]:
# Uniques
df.nunique() # Checking for number of variations in the data
Out[297]:
Unnamed: 0             8188
price                   889
rooms                    10
bathroom                  8
lift                      2
terrace                   2
square_meters           278
real_state                4
neighborhood             10
square_meters_price    2921
dtype: int64
In [298]:
df.columns
Out[298]:
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
       'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
      dtype='object')
In [299]:
for i in ['rooms', 'bathroom', 'lift', 'terrace', 'real_state', 'neighborhood']: # Checking uniques
    print (i,": ",df[i].unique())
rooms :  [ 3  2  1  4  5  0  7  6 10  9]
bathroom :  [1 2 3 4 7 5 6 8]
lift :  [ True False]
terrace :  [False  True]
real_state :  ['flat' 'attic' nan 'apartment' 'study']
neighborhood :  ['Horta- Guinardo' 'Sant Andreu' 'Gràcia' 'Ciutat Vella'
 'Sarria-Sant Gervasi' 'Les Corts' 'Sant Martí' 'Eixample'
 'Sants-Montjuïc' 'Nou Barris']
In [301]:
# Uniques
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
for column in cat_cols:
    print(df[column].value_counts())
    print("-" * 50)
lift
True     5710
False    2478
Name: count, dtype: int64
--------------------------------------------------
terrace
False    6518
True     1670
Name: count, dtype: int64
--------------------------------------------------
real_state
flat         6505
apartment     991
attic         315
study         109
Name: count, dtype: int64
--------------------------------------------------
neighborhood
Eixample               2401
Sarria-Sant Gervasi    1396
Ciutat Vella           1365
Gràcia                  688
Sant Martí              617
Sants-Montjuïc          590
Les Corts               524
Horta- Guinardo         315
Sant Andreu             178
Nou Barris              114
Name: count, dtype: int64
--------------------------------------------------
  • There are four types of real states being the most common "flat"
  • Most units do not have terrace
  • Most units do have lift
  • The neighborhood with largest unit count is "Eixample"
In [302]:
# Duplicates
df.duplicated().sum() # Checking for duplicate entries in the data
Out[302]:
np.int64(0)
  • There are no duplicated observations

Consolidated notes on Data Overview¶

  • The variable 'Unnamed 0:' represent index and should be deleted from data
  • Target variable for modeling is "price"
  • There are 8188 rows and 10 columns.
  • Data types are aligned with information
  • There is missing data (NaN) on variable 'real_state'. To be replaced by "unknown"
  • Units size goes from 10m2 to 679m2, with a mean of 84.61m2
  • Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
  • price range is assumed referred to monthly rent, so considered as EUR per month
  • Units prices by square meter goes from 4.9EUR/m2/month to 186EUR/m2/month, with mean of 17.7EUR/m2/month
  • There are units listed with cero rooms
  • There are four types of real states being the most common "flat"
  • Most units do not have terrace
  • Most units do have lift
  • The neighborhood with largest unit count is "Eixample"
  • There are no duplicated observations

Exploratory Data Analysis (EDA)¶

EDA Functions¶

In [303]:
def univariate_numerical(data):
    '''
    Function to generate two plots for each numerical variable
    Histplot for variable distribution
    Boxplot for statistical summary 
    '''
    # Select numerical columns
    numerical_cols = data.select_dtypes(include=[np.number]).columns
    
    # Determine the number of rows and columns
    num_vars = len(numerical_cols)
    num_cols = 4
    num_rows = int(np.ceil(num_vars * 2 / num_cols))
    
    # Create a figure with the specified size
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5))
    
    # Flatten the axes array for easy iteration
    axes = axes.flatten()
    
    # Plot each variable with a histplot and a boxplot
    for i, col in enumerate(numerical_cols):
        mean_value = data[col].mean()
        
        # Histplot with KDE
        sns.histplot(data[col], kde=True, ax=axes[i*2])
        axes[i*2].axvline(mean_value, color='r', linestyle='--')
        axes[i*2].set_title(f'Distribution of {col}')
        axes[i*2].text(mean_value, axes[i*2].get_ylim()[1]*0.8, f'Mean: {mean_value:.2f}', color='r', va='baseline', ha='left',rotation=90)
        
        # Boxplot
        sns.boxplot(y=data[col], ax=axes[i*2 + 1])
        axes[i*2 + 1].axhline(mean_value, color='r', linestyle='--')
        axes[i*2 + 1].set_title(f'Boxplot of {col}')
        axes[i*2 + 1].text(axes[i*2 + 1].get_xlim()[1]*0.8, mean_value, f'mean: {mean_value:.2f}', color='r', va='baseline', ha='right')
    
    # Hide any remaining empty subplots
    for j in range(num_vars * 2, len(axes)):
        fig.delaxes(axes[j])
    
    # Adjust layout
    plt.tight_layout()
    plt.show()
In [304]:
def univariate_categorical(data):
    '''
    Function to generate countplot for each categorical variable
    Labeled with count and percentage
    '''
    # List of categorical columns
    categorical_columns = data.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Number of columns in the grid
    num_cols = 4
    
    # Calculate the number of rows needed
    num_rows = (len(categorical_columns) + num_cols - 1) // num_cols
    
    # Create the grid
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()
    
    # Plot each countplot in the grid
    for i, col in enumerate(categorical_columns):
        ax = axes[i]
        plot = sns.countplot(x=col, data=data, order=data[col].value_counts().index, ax=ax)
        ax.set_title(f'Count of {col}')
           
        # Add total count and percentage annotations
        total = len(data)
        for p in plot.patches:
            height = p.get_height()
            percentage = f'{(height / total * 100):.1f}%'
            plot.text(x=p.get_x() + p.get_width() / 2,
                      y=height + 2,
                      s=f'{height:.0f}\n({percentage})',
                      ha='center')
        
        # Limit x-axis labels to avoid overlap
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    
    # Remove any empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    # Show the plot
    plt.show()
In [305]:
# Function to plot crosstab with labels
def plot_crosstab_bar_count(df, var_interest):
    '''
    Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables
    Labeled with counts
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create a crosstab
        crosstab = pd.crosstab(df[col], df[var_interest])
        
        # Plot the crosstab as a bar plot
        crosstab.plot(kind='bar', stacked=True, ax=axes[i])
        
        # Annotate counts in the middle of each bar section
        for bar in axes[i].patches:
            height = bar.get_height()
            if height > 0:
                axes[i].annotate(f'{int(height)}', 
                                 (bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
                                 ha='center', va='center', fontsize=10, color='black')
        
        # Add total labels at the top of each bar
        totals = crosstab.sum(axis=1)
        for j, total in enumerate(totals):
            axes[i].annotate(f'Total: {total}', 
                             (j, totals[j]), 
                             ha='center', va='bottom', weight='bold')

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

# Usage
#plot_crosstab_bar_count(df, var_interest='var_interest')
In [306]:
def plot_crosstab_heat_perc(df, var_interest,df_name="DataFrame"):
    '''
    Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables
    Labeled with counts, percentage by row, percentage by column
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(6*num_cols, num_rows * 6))
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create crosstabs
        crosstab = pd.crosstab(df[col], df[var_interest])
        crosstab_perc_row = crosstab.div(crosstab.sum(axis=1), axis=0) * 100
        crosstab_perc_col = crosstab.div(crosstab.sum(axis=0), axis=1) * 100

        # Combine counts with percentages
        crosstab_combined = crosstab.astype(str) + "\n" + \
                            crosstab_perc_row.round(2).astype(str) + "%" + "\n" + \
                            crosstab_perc_col.round(2).astype(str) + "%"

        # Plot the crosstab as a heatmap
        sns.heatmap(crosstab, annot=crosstab_combined, fmt='', cmap='Blues', ax=axes[i], cbar=False, annot_kws={"size": 8})
        axes[i].set_title(f'Crosstab of {col} and {var_interest} - {df_name}', fontsize=12)

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    # Adjust layout to prevent label overlapping
    plt.subplots_adjust(hspace=0.4, wspace=0.4)  # Add more space between subplots
    plt.tight_layout()
    plt.show()
    
# Usage
#plot_crosstab_heat_perc(df, var_interest='var_interest')
In [307]:
def boxplot_by_group(df, group, var, outliers, df_name="DataFrame"):
    '''
    boxplot for a numerical variable of interest vs a categorical variable
    with or without outliers
    includes data mean and mean by category
    '''
    # Calculate the average for the variable
    var_avg = df[var].mean()
    
    # Calculate variable mean per group
    var_means = df.groupby(group)[var].mean()
    
    # Sort by means and get the sorted order
    var_sorted = var_means.sort_values(ascending=False).index
    
    # Reorder the DataFrame by the sorted group
    df[group] = pd.Categorical(df[group], categories=var_sorted, ordered=True)
    
    # Create the boxplot with the reordered sectors
    ax = sns.boxplot(data=df, x=group, y=var, order=var_sorted, showfliers=outliers)
    
    # Add horizontal line for average variable value
    plt.axhline(var_avg, color='red', linestyle='--', label=f'Avg {var}: {var_avg:.2f}')
    
    # Scatter plot for means
    x_positions = range(len(var_means.sort_values(ascending=False)))
    plt.scatter(x=x_positions, y=var_means.sort_values(ascending=False), color='red', label='Mean', zorder=5)
    
    # Add labels to each red dot with the mean value
    for i, mean in enumerate(var_means.sort_values(ascending=False)):
        plt.text(i, mean, f'{mean:.2f}', color='red', ha='center', va='bottom')
    
    # Rotate x-axis labels
    plt.xticks(ticks=x_positions, labels=var_means.sort_values(ascending=False).index, rotation=90)
    
    # Add a legend
    plt.legend()
    plt.xlabel('')  # Remove x-axis title
    
    # Add plot title with DataFrame name
    plt.title(f'Boxplot of {var} by {group} - {df_name}')
    
    # Adjust layout
    plt.tight_layout()
    
    # Display the plot
    #plt.show()

Functions:

  • univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
  • univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
  • plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
  • plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
  • boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category

Univariate Analysis¶

In [308]:
univariate_numerical(df)
No description has been provided for this image
  • The variables "Unnamed: 0" have a uniform distribution
  • The numerical variables have a shift to the right
In [309]:
univariate_categorical(df);
No description has been provided for this image
In [310]:
df.loc[(df['real_state']=="flat")].describe().T
Out[310]:
count mean std min 25% 50% 75% max
Unnamed: 0 6505.0 4022.892083 2319.413714 0.000000 2059.0 3989.0 5943.000000 8187.0
price 6505.0 1316.758032 913.617019 320.000000 865.0 1050.0 1380.000000 15000.0
rooms 6505.0 2.541430 1.091669 0.000000 2.0 3.0 3.000000 10.0
bathroom 6505.0 1.510530 0.720411 1.000000 1.0 1.0 2.000000 8.0
square_meters 6505.0 85.514527 45.535769 11.000000 60.0 75.0 95.000000 679.0
square_meters_price 6505.0 15.755932 5.405202 5.555556 12.5 14.5 17.692308 100.0
In [311]:
df.loc[(df['neighborhood']=="Eixample")].describe().T
Out[311]:
count mean std min 25% 50% 75% max
Unnamed: 0 2401.0 4202.712620 2371.995010 12.000000 2057.0 4236.000000 6284.00 8187.0
price 2401.0 1592.147022 1150.448403 490.000000 1000.0 1225.000000 1700.00 15000.0
rooms 2401.0 2.563099 1.098994 0.000000 2.0 3.000000 3.00 9.0
bathroom 2401.0 1.583923 0.681096 1.000000 1.0 2.000000 2.00 8.0
square_meters 2401.0 87.946689 40.787208 20.000000 64.0 80.000000 100.00 350.0
square_meters_price 2401.0 19.018949 11.321671 6.741573 13.0 15.714286 20.63 144.5
  • The categorical variables are not balanced, with 79% of properties as "apartments" and 78% of units concentrated in 50% of the sample neighbourhoods
  • 75% of flats units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85m2.
  • 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 87m2.

Bivariate Analysis¶

In [312]:
# Create a PairGrid
g = sns.PairGrid(df, corner=True)

# Map different plots to the grid
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot,kde=True)

# Show the plot
plt.show()
No description has been provided for this image
In [313]:
# Calculate correlation matrix
corr_matrix = df.select_dtypes(include=np.number).corr()
In [314]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
No description has been provided for this image
In [315]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
#sorted_corr # Display the sorted correlation table
In [316]:
# Define a function to categorize the correlation level
def categorize_correlation(correlation):
    abs_corr = abs(correlation) * 100  # Convert to percentage for easier comparison
    if abs_corr < 30:
        return 'Negligible'
    elif 30 <= abs_corr < 50:
        return 'Low'
    elif 50 <= abs_corr < 70:
        return 'Moderate'
    elif 70 <= abs_corr < 90:
        return 'High'
    else:
        return 'Very High'
In [317]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
Out[317]:
corr_lvl
Negligible    8
Moderate      5
High          1
Low           1
Name: count, dtype: int64
In [318]:
sorted_corr
Out[318]:
variable1 variable2 correlation corr_lvl
22 bathroom square_meters 0.755291 High
10 price square_meters 0.690398 Moderate
16 rooms square_meters 0.644285 Moderate
9 price bathroom 0.579747 Moderate
15 rooms bathroom 0.572576 Moderate
11 price square_meters_price 0.520627 Moderate
8 price rooms 0.348086 Low
17 rooms square_meters_price -0.223946 Negligible
29 square_meters square_meters_price -0.128385 Negligible
5 Unnamed: 0 square_meters_price 0.069195 Negligible
4 Unnamed: 0 square_meters -0.055379 Negligible
2 Unnamed: 0 rooms -0.042374 Negligible
3 Unnamed: 0 bathroom -0.040113 Negligible
23 bathroom square_meters_price -0.013050 Negligible
1 Unnamed: 0 price -0.010116 Negligible
  • 'square_meters' has a positive correlation with 'price', 'rooms' and 'bathrooms'
  • 'square_meters_price' has a negative correlation with 'square_meters', 'rooms' and 'bathrooms'
  • There are only one couple of variables with high correlation: bathroom-square_meters (0.75)
In [319]:
df.columns
Out[319]:
Index(['Unnamed: 0', 'price', 'rooms', 'bathroom', 'lift', 'terrace',
       'square_meters', 'real_state', 'neighborhood', 'square_meters_price'],
      dtype='object')
In [320]:
# check unique rooms-bathroom combinations
unique_combinations=df.groupby(['rooms', 'bathroom']).size().reset_index(name='count')
unique_combinations_sorted=unique_combinations.sort_values(by='count',ascending=False)
In [321]:
# Calculate cumulative sum of counts
unique_combinations_sorted['cum_sum'] = unique_combinations_sorted['count'].cumsum()
In [322]:
# Calculate the cumulative percentage
unique_combinations_sorted['perc'] = unique_combinations_sorted['count'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted['cum_perc'] = unique_combinations_sorted['cum_sum'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted.head(10)
Out[322]:
rooms bathroom count cum_sum perc cum_perc
6 2 1 1836 1836 22.423058 22.423058
4 1 1 1519 3355 18.551539 40.974597
11 3 2 1207 4562 14.741085 55.715681
10 3 1 1123 5685 13.715193 69.430874
7 2 2 735 6420 8.976551 78.407426
17 4 2 641 7061 7.828530 86.235955
16 4 1 191 7252 2.332682 88.568637
0 0 1 188 7440 2.296043 90.864680
18 4 3 173 7613 2.112848 92.977528
12 3 3 111 7724 1.355642 94.333170
  • The most popular unit configuration in the dataset is 2 bedrooms and 1 bathroom with 1836 units (22.42% of all units).
  • Other popular configurations are 1-1 (18.55%), 3-2 (14.74%), 3-1 (13.71%), 2-2 (8.97%) and 4-2 (7.82%)
  • These six most popular unit configurations represent 86.23% of all units
In [323]:
df_pop = df.merge(unique_combinations_sorted.head(10), on=['rooms', 'bathroom'])
df_pop
Out[323]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price count cum_sum perc cum_perc
0 0 750 3 1 True False 60 flat Horta- Guinardo 12.500000 1123 5685 13.715193 69.430874
1 1 770 2 1 True False 59 flat Sant Andreu 13.050847 1836 1836 22.423058 22.423058
2 2 1300 1 1 True True 30 flat Gràcia 43.333333 1519 3355 18.551539 40.974597
3 3 2800 1 1 True True 70 flat Ciutat Vella 40.000000 1519 3355 18.551539 40.974597
4 4 720 2 1 True False 44 flat Sant Andreu 16.363636 1836 1836 22.423058 22.423058
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7719 8183 1075 2 2 False False 65 flat Gràcia 16.538462 735 6420 8.976551 78.407426
7720 8184 1500 3 2 True False 110 flat Eixample 13.636364 1207 4562 14.741085 55.715681
7721 8185 1500 2 2 True True 90 flat Sarria-Sant Gervasi 16.666667 735 6420 8.976551 78.407426
7722 8186 1500 3 2 True False 110 flat Eixample 13.636364 1207 4562 14.741085 55.715681
7723 8187 1500 3 2 True False 110 flat Eixample 13.636364 1207 4562 14.741085 55.715681

7724 rows × 14 columns

In [324]:
print(df.shape,df_pop.shape)
(8188, 10) (7724, 14)
  • The "df_pop" data frame includes the most popular units in terms of bedroom/bathroom configuration, representing 94% of the samples
In [325]:
# Define the function to create and display side-by-side boxplots
def side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2):
    fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
    
    # First subplot for df1
    plt.sca(axes[0])
    boxplot_by_group(df1, group, var, outliers, title1)
    
    # Second subplot for df2
    plt.sca(axes[1])
    boxplot_by_group(df2, group, var, outliers, title2)
    
    # Show both plots after setup
    plt.show()

# Usage
#side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")
In [326]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")
No description has been provided for this image
In [327]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
  • Sarrià-Sant Gervasi, Les Corts, Eixample and San Martí are the most expensive neighbourhoods with average prices above the dataset average.
  • Sants-Montjuïc, Horta-Guinardó, Sant Andreu and Nou Barris are the cheapest neighbourhoods with average prices below the dataset average.
In [328]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
  • When comparing the price per square metre, Ciutat Vella and Eixample are the most expensive neighbourhoods.
  • If we compare square meters, Ciutat Vella is the second lowest and Eixample the third
In [329]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
  • From the perspective of price per square meter, the most attractive neighborhood according to this data could be Les Corts, with an average surface area of ​​89.79 m2 above the average (78.67 m2) and a price per square meter of 15.85 below the average (17.79)
In [330]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'price', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
In [331]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
In [332]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")
No description has been provided for this image
  • From the perspective of price per square meter, the most attractive type of unit according to this data could be the apartment, with an average surface area of ​​80 m2 above the average (78.67 m2) and a price per square meter of 15.76 below the average (17.79)
In [333]:
plot_crosstab_heat_perc(df, var_interest='real_state',df_name="All units")
No description has been provided for this image
In [334]:
plot_crosstab_heat_perc(df_pop, var_interest='real_state',df_name="Popular units")
No description has been provided for this image
  • There are 1,777 flats in Eixample, being the most popular unit type and neighborhood combination, with 79.68% of the units in Eixample being flats, and 28.9% of the flats are in Eixample. - In Les Courts there are only 398 flats, which makes it far from the most popular type of housing and neighbourhood combination, although 87.67% of the dwellings in Les Courts are flats, only 6.47% of the flats are in Les Courts.
In [335]:
plot_crosstab_bar_count(df, var_interest='lift')
No description has been provided for this image
  • Most types of units have a lift, in the case of flats the proportion is 74.12%
In [336]:
plot_crosstab_bar_count(df, var_interest='terrace')
No description has been provided for this image
  • Units with a terrace on the other hand, seem to be rare and very few have one

Consolidated notes on Exploratory Data Analysis¶

  • univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
  • univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
  • plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
  • plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
  • boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category
  • The variables "Unnamed: 0" have a uniform distribution
  • The numerical variables have a shift to the right
  • The categorical variables are not balanced, with 79% of properties as "apartments" and 78% of units concentrated in 50% of the sample neighbourhoods
  • 75% of "flat" units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85m2.
  • 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 87m2.
  • 'square_meters' has a positive correlation with 'price', 'rooms' and 'bathrooms'
  • 'square_meters_price' has a negative correlation with 'square_meters', 'rooms' and 'bathrooms'
  • There are only one couple of variables with high correlation: bathroom-square_meters (0.75)
  • The most popular unit configuration in the dataset is 2 bedrooms and 1 bathroom with 1836 units (22.42% of all units).
  • Other popular configurations are 1-1 (18.55%), 3-2 (14.74%), 3-1 (13.71%), 2-2 (8.97%) and 4-2 (7.82%)
  • These six most popular unit configurations represent 86.23% of all units
  • The "df_pop" data frame includes the most popular units in terms of bedroom/bathroom configuration, representing 94% of the samples
  • Sarrià-Sant Gervasi, Les Corts, Eixample and San Martí are the most expensive neighbourhoods with average prices above the dataset average.
  • Sants-Montjuïc, Horta-Guinardó, Sant Andreu and Nou Barris are the cheapest neighbourhoods with average prices below the dataset average.
  • When comparing the price per square metre, Ciutat Vella and Eixample are the most expensive neighbourhoods. - If we compare square meters, Ciutat Vella is the second lowest and Eixample the third
  • From the perspective of price per square meter, the most attractive neighborhood according to this data could be Les Corts, with an average surface area of ​​89.79 m2 above the average (78.67 m2) and a price per square meter of 15.85 below the average (17.79)
  • From the perspective of price per square meter, the most attractive type of unit according to this data could be the apartment, with an average surface area of ​​80 m2 above the average (78.67 m2) and a price per square meter of 15.76 below the average (17.79)
  • There are 1,777 flats in Eixample, being the most popular unit type and neighborhood combination, with 79.68% of the units in Eixample being flats, and 28.9% of the flats are in Eixample. - In Les Courts there are only 398 flats, which makes it far from the most popular type of housing and neighbourhood combination, although 87.67% of the dwellings in Les Courts are flats, only 6.47% of the flats are in Les Courts.
  • Most types of units have a lift, in the case of flats the proportion is 74.12%
  • Units with a terrace on the other hand, seem to be rare and very few have one

Data Preprocesing¶

  • Missing value treatment
  • Feature engineering
  • Outlier detection and treatment
  • Any other preprocessing steps
In [337]:
df2=df.copy() # Data preprocesing over a copy of original dataset
  • Preprocessed data on a copy of the original dataset named df2
In [338]:
df2.isna().sum() # missing values per feature
Out[338]:
Unnamed: 0               0
price                    0
rooms                    0
bathroom                 0
lift                     0
terrace                  0
square_meters            0
real_state             268
neighborhood             0
square_meters_price      0
dtype: int64
In [339]:
df2['real_state'].value_counts(dropna=False)
Out[339]:
real_state
flat         6505
apartment     991
attic         315
NaN           268
study         109
Name: count, dtype: int64
In [340]:
# Add 'unknown' to categories
df2['real_state'] = pd.Categorical(df2['real_state'])
df2['real_state'] = df2['real_state'].cat.add_categories("unknown")

# Replace NaN values with 'unknown'
df2['real_state'] = df2['real_state'].fillna("unknown")
  • Created a new category "unknown" in the variable 'real_state' replacing NaN
In [341]:
df2.isna().sum() # missing values per feature
Out[341]:
Unnamed: 0             0
price                  0
rooms                  0
bathroom               0
lift                   0
terrace                0
square_meters          0
real_state             0
neighborhood           0
square_meters_price    0
dtype: int64
In [342]:
df2['real_state'].value_counts()
Out[342]:
real_state
flat         6505
apartment     991
attic         315
unknown       268
study         109
Name: count, dtype: int64
In [343]:
df2.head()
Out[343]:
Unnamed: 0 price rooms bathroom lift terrace square_meters real_state neighborhood square_meters_price
0 0 750 3 1 True False 60 flat Horta- Guinardo 12.500000
1 1 770 2 1 True False 59 flat Sant Andreu 13.050847
2 2 1300 1 1 True True 30 flat Gràcia 43.333333
3 3 2800 1 1 True True 70 flat Ciutat Vella 40.000000
4 4 720 2 1 True False 44 flat Sant Andreu 16.363636
In [344]:
df2.drop(['Unnamed: 0'], axis=1, inplace=True)
  • Removed the variable "Unnamed: 0" which had no value for modeling
In [345]:
# function to check for outliers
def count_outliers(df):
    outlier_count=0
    for column in df.select_dtypes(include=np.number).columns:
        outliers=len(df[(df[column] < df[column].quantile(0.25)-1.5*(df[column].quantile(0.75)-df[column].quantile(0.25))) | (df[column] > df[column].quantile(0.75)+1.5*(df[column].quantile(0.75)-df[column].quantile(0.25)))][column])
        print(f'{column}: {outliers} outliers ({outliers/df.shape[0]*100:.2f}%)')
        outlier_count+= outliers
    return outlier_count
In [346]:
count_outliers(df2)
price: 807 outliers (9.86%)
rooms: 474 outliers (5.79%)
bathroom: 174 outliers (2.13%)
square_meters: 598 outliers (7.30%)
square_meters_price: 589 outliers (7.19%)
Out[346]:
2642
In [347]:
df2.shape
Out[347]:
(8188, 9)
  • There are outliers in all numerical variables. df2_shape:(8188, 9)
In [348]:
# Calculate z-scores for only numeric columns without creating dummies
outlier_mask = (np.abs(df2.select_dtypes(include=np.number).apply(zscore)) < 3).all(axis=1)

# Filter the DataFrame based on the outlier mask and retain the original column structure
df3 = df2[outlier_mask]
In [349]:
count_outliers(df3)
price: 672 outliers (8.68%)
rooms: 318 outliers (4.11%)
bathroom: 0 outliers (0.00%)
square_meters: 438 outliers (5.66%)
square_meters_price: 449 outliers (5.80%)
Out[349]:
1877
In [350]:
df3.shape
Out[350]:
(7742, 9)
  • Applied the Z-score method, which removes outliers with more than 3 standard deviations. Some variables with a relevant percentage of outliers still remain. df3_shape:(7742, 9)
In [351]:
df4=df3.copy()
for column in df4.select_dtypes(include=np.number).columns:
    df4[column]=np.clip(df4[column], df4[column].quantile(0.25)-1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)), df4[column].quantile(0.75)+1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)))
In [352]:
count_outliers(df4)
price: 0 outliers (0.00%)
rooms: 0 outliers (0.00%)
bathroom: 0 outliers (0.00%)
square_meters: 0 outliers (0.00%)
square_meters_price: 0 outliers (0.00%)
Out[352]:
0
In [353]:
df4.shape
Out[353]:
(7742, 9)
  • Limited outliers to respective whisker boundaries. df4_shape:(7742, 9)
In [406]:
#creating dumies
df5 = pd.get_dummies(df4, columns=['real_state','neighborhood'], drop_first=False)
In [407]:
df5.shape
Out[407]:
(7742, 22)
  • Created dummy variables for variables 'real_state' and 'neighborhood' considering drop_first=False. df5_shape:(7742, 22)
In [411]:
df5.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7742 entries, 0 to 8187
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   price                             7742 non-null   int64  
 1   rooms                             7742 non-null   float64
 2   bathroom                          7742 non-null   int64  
 3   lift                              7742 non-null   int64  
 4   terrace                           7742 non-null   int64  
 5   square_meters                     7742 non-null   int64  
 6   square_meters_price               7742 non-null   float64
 7   real_state_apartment              7742 non-null   int64  
 8   real_state_attic                  7742 non-null   int64  
 9   real_state_flat                   7742 non-null   int64  
 10  real_state_study                  7742 non-null   int64  
 11  real_state_unknown                7742 non-null   int64  
 12  neighborhood_Sarria-Sant Gervasi  7742 non-null   int64  
 13  neighborhood_Les Corts            7742 non-null   int64  
 14  neighborhood_Eixample             7742 non-null   int64  
 15  neighborhood_Sant Martí           7742 non-null   int64  
 16  neighborhood_Ciutat Vella         7742 non-null   int64  
 17  neighborhood_Gràcia               7742 non-null   int64  
 18  neighborhood_Sants-Montjuïc       7742 non-null   int64  
 19  neighborhood_Horta- Guinardo      7742 non-null   int64  
 20  neighborhood_Sant Andreu          7742 non-null   int64  
 21  neighborhood_Nou Barris           7742 non-null   int64  
dtypes: float64(2), int64(20)
memory usage: 1.4 MB
In [412]:
# Convert boolean to numeric
cols = df5.select_dtypes(['bool'])
for i in cols.columns:
    df5[i] = df5[i].astype('int')
In [413]:
df5.head()
Out[413]:
price rooms bathroom lift terrace square_meters square_meters_price real_state_apartment real_state_attic real_state_flat ... neighborhood_Sarria-Sant Gervasi neighborhood_Les Corts neighborhood_Eixample neighborhood_Sant Martí neighborhood_Ciutat Vella neighborhood_Gràcia neighborhood_Sants-Montjuïc neighborhood_Horta- Guinardo neighborhood_Sant Andreu neighborhood_Nou Barris
0 750 3.0 1 1 0 60 12.500000 0 0 1 ... 0 0 0 0 0 0 0 1 0 0
1 770 2.0 1 1 0 59 13.050847 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
2 1300 1.0 1 1 1 30 28.384073 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
3 2225 1.0 1 1 1 70 28.384073 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
4 720 2.0 1 1 0 44 16.363636 0 0 1 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 22 columns

  • Boolean variables were converted to numeric

Consolidated notes on Data Preprocesing¶

  • Preprocessed data on a copy of the original dataset named df2
  • Created a new category "unknown" in the variable 'real_state' replacing NaN
  • Removed the variable "Unnamed: 0" which had no value for modeling
  • There are outliers in all numerical variables. df2_shape:(8188, 9)
  • Applied the Z-score method, which removes outliers with more than 3 standard deviations. Some variables with a relevant percentage of outliers still remain. df3_shape:(7742, 9)
  • Limited outliers to respective whisker boundaries. df4_shape:(7742, 9)
  • Created dummy variables for variables 'real_state' and 'neighborhood' considering drop_first=False. df5_shape:(7742, 22)
  • Boolean variables were converted to numeric

EDA (pre-modeling)¶

In [414]:
df5.shape
Out[414]:
(7742, 22)
In [415]:
df5.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7742 entries, 0 to 8187
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   price                             7742 non-null   int64  
 1   rooms                             7742 non-null   float64
 2   bathroom                          7742 non-null   int64  
 3   lift                              7742 non-null   int64  
 4   terrace                           7742 non-null   int64  
 5   square_meters                     7742 non-null   int64  
 6   square_meters_price               7742 non-null   float64
 7   real_state_apartment              7742 non-null   int64  
 8   real_state_attic                  7742 non-null   int64  
 9   real_state_flat                   7742 non-null   int64  
 10  real_state_study                  7742 non-null   int64  
 11  real_state_unknown                7742 non-null   int64  
 12  neighborhood_Sarria-Sant Gervasi  7742 non-null   int64  
 13  neighborhood_Les Corts            7742 non-null   int64  
 14  neighborhood_Eixample             7742 non-null   int64  
 15  neighborhood_Sant Martí           7742 non-null   int64  
 16  neighborhood_Ciutat Vella         7742 non-null   int64  
 17  neighborhood_Gràcia               7742 non-null   int64  
 18  neighborhood_Sants-Montjuïc       7742 non-null   int64  
 19  neighborhood_Horta- Guinardo      7742 non-null   int64  
 20  neighborhood_Sant Andreu          7742 non-null   int64  
 21  neighborhood_Nou Barris           7742 non-null   int64  
dtypes: float64(2), int64(20)
memory usage: 1.4 MB
In [416]:
df5.describe().T
Out[416]:
count mean std min 25% 50% 75% max
price 7742.0 1208.508525 467.396946 320.000000 850.000000 1080.00000 1400.0 2225.000000
rooms 7742.0 2.355464 1.014828 0.500000 2.000000 2.00000 3.0 4.500000
bathroom 7742.0 1.425084 0.573727 1.000000 1.000000 1.00000 2.0 3.000000
lift 7742.0 0.696461 0.459816 0.000000 0.000000 1.00000 1.0 1.000000
terrace 7742.0 0.193878 0.395360 0.000000 0.000000 0.00000 0.0 1.000000
square_meters 7742.0 77.151640 29.488338 10.000000 55.000000 71.00000 91.0 145.000000
square_meters_price 7742.0 16.445057 5.020367 5.555556 12.743952 15.09434 19.0 28.384073
real_state_apartment 7742.0 0.108887 0.311517 0.000000 0.000000 0.00000 0.0 1.000000
real_state_attic 7742.0 0.035133 0.184128 0.000000 0.000000 0.00000 0.0 1.000000
real_state_flat 7742.0 0.810385 0.392022 0.000000 1.000000 1.00000 1.0 1.000000
real_state_study 7742.0 0.014079 0.117825 0.000000 0.000000 0.00000 0.0 1.000000
real_state_unknown 7742.0 0.031516 0.174720 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Sarria-Sant Gervasi 7742.0 0.161586 0.368095 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Les Corts 7742.0 0.061224 0.239757 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Eixample 7742.0 0.289848 0.453721 0.000000 0.000000 0.00000 1.0 1.000000
neighborhood_Sant Martí 7742.0 0.076466 0.265759 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Ciutat Vella 7742.0 0.172307 0.377671 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Gràcia 7742.0 0.086153 0.280609 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Sants-Montjuïc 7742.0 0.074399 0.262437 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Horta- Guinardo 7742.0 0.040300 0.196674 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Sant Andreu 7742.0 0.022991 0.149886 0.000000 0.000000 0.00000 0.0 1.000000
neighborhood_Nou Barris 7742.0 0.014725 0.120457 0.000000 0.000000 0.00000 0.0 1.000000
  • The shape of the data for modeling (7742, 22) does not account for outliers
  • The data for modeling have no missing values
In [417]:
univariate_numerical(df5)
No description has been provided for this image
  • No outliers in the data
In [418]:
# Calculate correlation matrix
corr_matrix = df5.select_dtypes(include=np.number).corr()
In [419]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()
No description has been provided for this image
In [420]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
sorted_corr # Display the sorted correlation table
Out[420]:
variable1 variable2 correlation
163 real_state_apartment real_state_flat -0.722653
49 bathroom square_meters 0.706826
5 price square_meters 0.693492
27 rooms square_meters 0.681368
2 price bathroom 0.596550
... ... ... ...
261 real_state_unknown neighborhood_Horta- Guinardo 0.000627
238 real_state_study neighborhood_Sants-Montjuïc -0.000458
260 real_state_unknown neighborhood_Sants-Montjuïc -0.000432
217 real_state_flat neighborhood_Horta- Guinardo 0.000268
210 real_state_flat neighborhood_Sarria-Sant Gervasi 0.000187

231 rows × 3 columns

In [421]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()
Out[421]:
corr_lvl
Negligible    217
Low             8
Moderate        4
High            2
Name: count, dtype: int64
In [422]:
sorted_corr
Out[422]:
variable1 variable2 correlation corr_lvl
163 real_state_apartment real_state_flat -0.722653 High
49 bathroom square_meters 0.706826 High
5 price square_meters 0.693492 Moderate
27 rooms square_meters 0.681368 Moderate
2 price bathroom 0.596550 Moderate
... ... ... ... ...
261 real_state_unknown neighborhood_Horta- Guinardo 0.000627 Negligible
238 real_state_study neighborhood_Sants-Montjuïc -0.000458 Negligible
260 real_state_unknown neighborhood_Sants-Montjuïc -0.000432 Negligible
217 real_state_flat neighborhood_Horta- Guinardo 0.000268 Negligible
210 real_state_flat neighborhood_Sarria-Sant Gervasi 0.000187 Negligible

231 rows × 4 columns

  • Low correlation between variables, with only a couple of variables having a high correlation (bathroom and square_meters)

Consolidated notes on EDA (pre-modeling)¶

  • The shape of the data for modeling (7742, 22) does not account for outliers
  • The data for modeling have no missing values
  • No outliers in the data
  • Low correlation between variables, with only a couple of variables having a high correlation (bathroom and square_meters)

Modeling¶

Modeling Functions¶

In [423]:
# Define a function to evaluate and return the model's metrics
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    metrics = {
        "MAE": mean_absolute_error(y_test, y_pred),
        "MSE": mean_squared_error(y_test, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
        "R2 Score": r2_score(y_test, y_pred)
    }
    return metrics
In [424]:
def models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None):
    """
    Function to compute different metrics to check regression model performance
    models: list of models to evaluate
    x_train, y_train: training set predictors and target
    x_test, y_test: validation set predictors and target
    res_df: dataframe to store performance metrics
    
    After fitting the model on the training data, the model is directly evaluated on the validation set.
    """
    
    # If results_df is not provided, create an empty dataframe
    if res_df is None:
        res_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
        
    # Store new rows in a list
    results_list = []
    
    # Perform cross-validation on the training set
    for name, model in models:
        # K-Fold cross-validation
        kfold = KFold(n_splits=5, shuffle=True, random_state=1)
        
        # Cross-validation scores
        cv_mse = -cross_val_score(model, x_train, y_train, scoring="neg_mean_squared_error", cv=kfold)
        cv_mae = -cross_val_score(model, x_train, y_train, scoring="neg_mean_absolute_error", cv=kfold)
        cv_r2 = cross_val_score(model, x_train, y_train, scoring="r2", cv=kfold)
        
        # Calculate mean values for each metric
        train_mse = cv_mse.mean()
        train_mae = cv_mae.mean()
        train_r2 = cv_r2.mean()

        # Add the new row with the provided metrics
        results_list.append({
            "Model": f"{name}_train", 
            "MAE": train_mae,
            "MSE": train_mse, 
            "RMSE": np.sqrt(train_mse), 
            "R2 Score": train_r2, 
        })

        # Fit model and predict on test data
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        
        # Calculate test metrics
        test_mae = mean_absolute_error(y_test, y_pred)
        test_mse = mean_squared_error(y_test, y_pred)
        test_r2 = r2_score(y_test, y_pred)

        # Add the new row with the provided metrics
        results_list.append({
            "Model": f"{name}_val", 
            "MAE": test_mae,
            "MSE": test_mse, 
            "RMSE": np.sqrt(test_mse), 
            "R2 Score": test_r2, 
        })
        
    # Concatenate the list of results to results_df
    res_df = pd.concat([res_df, pd.DataFrame(results_list)], ignore_index=True)
    
    return res_df
  • Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
  • Defined function "models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None)", to compute different metrics to check regression model performance

Preparing data for modeling¶

In [425]:
data=df5.copy()
  • Data preparation for modeling over a copy named "data"
In [426]:
# specifying the independent  and dependent variables
X = data.drop(["price"], axis=1)
y = data["price"]

# Checking independant and dependant shapes
print("Shape of independant variables : ", X.shape)
print("Shape of target variable : ", y.shape)
Shape of independant variables :  (7742, 21)
Shape of target variable :  (7742,)
In [427]:
X.head()
Out[427]:
rooms bathroom lift terrace square_meters square_meters_price real_state_apartment real_state_attic real_state_flat real_state_study ... neighborhood_Sarria-Sant Gervasi neighborhood_Les Corts neighborhood_Eixample neighborhood_Sant Martí neighborhood_Ciutat Vella neighborhood_Gràcia neighborhood_Sants-Montjuïc neighborhood_Horta- Guinardo neighborhood_Sant Andreu neighborhood_Nou Barris
0 3.0 1 1 0 60 12.500000 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0
1 2.0 1 1 0 59 13.050847 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
2 1.0 1 1 1 30 28.384073 0 0 1 0 ... 0 0 0 0 0 1 0 0 0 0
3 1.0 1 1 1 70 28.384073 0 0 1 0 ... 0 0 0 0 1 0 0 0 0 0
4 2.0 1 1 0 44 16.363636 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 21 columns

In [428]:
# Apply Scaling to Input
input_scaler = MinMaxScaler()
X = pd.DataFrame(input_scaler.fit_transform(X), columns=X.columns)
In [429]:
# Apply Scaling to Output (target variable)
output_scaler = MinMaxScaler()
y = pd.DataFrame(output_scaler.fit_transform(y.values.reshape(-1, 1)), columns=["price"])
  • Min-Max scaling was applied. The dataset has features with different scales, normalization ensures that no feature dominates the learning process.
In [430]:
# adding a constant to the independent variables
X = sm.add_constant(X)
In [431]:
X.head()
Out[431]:
const rooms bathroom lift terrace square_meters square_meters_price real_state_apartment real_state_attic real_state_flat ... neighborhood_Sarria-Sant Gervasi neighborhood_Les Corts neighborhood_Eixample neighborhood_Sant Martí neighborhood_Ciutat Vella neighborhood_Gràcia neighborhood_Sants-Montjuïc neighborhood_Horta- Guinardo neighborhood_Sant Andreu neighborhood_Nou Barris
0 1.0 0.625 0.0 1.0 0.0 0.370370 0.304200 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 1.0 0.375 0.0 1.0 0.0 0.362963 0.328330 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.125 0.0 1.0 1.0 0.148148 1.000000 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 1.0 0.125 0.0 1.0 1.0 0.444444 1.000000 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 1.0 0.375 0.0 1.0 0.0 0.251852 0.473446 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows × 22 columns

In [432]:
X['const'].value_counts()
Out[432]:
const
1.0    7742
Name: count, dtype: int64
  • Added constant for modeling
In [433]:
# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)


# Checking training and test sets.
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Shape of Training predictions : ", y_train.shape)
print("Shape of test predictions : ", y_test.shape)
Shape of Training set :  (5419, 22)
Shape of test set :  (2323, 22)
Shape of Training predictions :  (5419, 1)
Shape of test predictions :  (2323, 1)
In [434]:
X_train.head()
Out[434]:
const rooms bathroom lift terrace square_meters square_meters_price real_state_apartment real_state_attic real_state_flat ... neighborhood_Sarria-Sant Gervasi neighborhood_Les Corts neighborhood_Eixample neighborhood_Sant Martí neighborhood_Ciutat Vella neighborhood_Gràcia neighborhood_Sants-Montjuïc neighborhood_Horta- Guinardo neighborhood_Sant Andreu neighborhood_Nou Barris
280 1.0 1.000 1.0 1.0 0.0 1.000000 0.501322 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1026 1.0 0.875 1.0 1.0 0.0 1.000000 0.851761 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
557 1.0 0.125 0.0 1.0 0.0 0.259259 0.632737 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
7525 1.0 0.875 0.0 0.0 0.0 0.481481 0.369908 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1255 1.0 0.875 0.5 1.0 1.0 0.666667 0.361147 0.0 0.0 1.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

  • Data split 70/30. Shape of Training set : (5419, 22), Shape of test set : (2323, 22)

Model Building¶

In [435]:
# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
  • Performance Metrics:
    • MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
    • MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
    • RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
    • R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
In [436]:
# Dictionary of regression models to try
regression_models = {
    "Linear Regression": LinearRegression(),
    "Lasso Regression": Lasso(),
    "Ridge Regression": Ridge(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "Support Vector Regressor": SVR()}
  • Evaluated the following regression models:
    • Linear Regression
    • Lasso Regression
    • Ridge Regression
    • Decision Tree
    • Random Forest
    • K-Nearest Neighbors
    • Support Vector Regressor
In [437]:
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
In [438]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[438]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [439]:
results_df.sort_values(by="MAE")
Out[439]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [440]:
results_df.sort_values(by="MSE")
Out[440]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [441]:
results_df.sort_values(by="RMSE")
Out[441]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
  • Random Forest: Best Performance. It has the lowest MAE (0.007321), MSE (0.000585), and RMSE (0.024185), and the highest R2 Score (0.990246), indicating it is the most accurate model among the ones tested.
  • Decision Tree: Second Best. It also performs very well with low MAE, MSE, and RMSE, and a high R2 Score.
  • Support Vector Regressor (SVR): Good Performance. It has a relatively low MAE, MSE, and RMSE, with a high R2 Score.
  • Linear Regression and Ridge Regression: Similar Performance. Both have similar values for all metrics indicating decent performance.
  • K-Nearest Neighbors (KNN): Moderate Performance. It has higher MAE, MSE, and RMSE, with a lower R2 Score, indicating it is less accurate compared to the top models.
  • Lasso Regression: Poor Performance: It has the worst (highest) MAE, MSE, and RMSE, with a negative R2 Score, indicating it performs poorly on this dataset.

Model Tuning¶

In [442]:
%%time
# Define the model
rf = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters for Random Forest:", best_params)
Best parameters for Random Forest: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 300}
CPU times: total: 5.67 s
Wall time: 3min 47s
In [443]:
%%time
# Define the model
dt = DecisionTreeRegressor()

# Define the parameter distribution
param_dist = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'criterion': ['mse', 'friedman_mse', 'mae']
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=dt, param_distributions=param_dist, n_iter=100, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters
best_params = random_search.best_params_
print("Best parameters for Decision Tree:", best_params)
Best parameters for Decision Tree: {'min_samples_split': 5, 'max_depth': 20, 'criterion': 'friedman_mse'}
CPU times: total: 391 ms
Wall time: 1.14 s
In [444]:
%%time
# Define the model
svr = SVR()

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=svr, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters for SVR:", best_params)
Best parameters for SVR: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
CPU times: total: 1.03 s
Wall time: 1min 20s
In [445]:
# Dictionary of regression models tuned
regression_models_tuned = {
    "Tuned Random Forest": RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300),
    "Tuned Decision Tree": DecisionTreeRegressor(min_samples_split= 5, max_depth= 20, criterion='friedman_mse'),
    "Tuned Support Vector Regressor": SVR(C= 100, gamma='scale', kernel='rbf')}
  • The three best performing models were selected for model tuning: Random Forest, Decision Tree, and Support Vector Regression.
In [446]:
# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models_tuned.items():
    model.fit(X_train, y_train)
    metrics = evaluate_model(model, X_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)
In [447]:
# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)
Out[447]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
4 Random Forest 0.007321 0.000585 0.024185 0.990246
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [448]:
results_df.sort_values(by="MAE")
Out[448]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [449]:
results_df.sort_values(by="MSE")
Out[449]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
4 Random Forest 0.007321 0.000585 0.024185 0.990246
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [450]:
results_df.sort_values(by="RMSE")
Out[450]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
4 Random Forest 0.007321 0.000585 0.024185 0.990246
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
  • After hyperparameter tuning, Tuned Random Forest, Random Forest, and Tuned Decision Tree are the best performing models
  • The 3 best performing models after hyperparameter tuning will be evaluated with CrossValidation

Model Validation¶

In [451]:
# Splitting data into training, validation and test set (60-20-20)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=1)
print(X_train.shape, X_val.shape, X_test.shape)
(4644, 22) (1549, 22) (1549, 22)
  • 60-20-20 data split for training-validation-testing
In [455]:
# List of models to evaluate
top3models = [
    ("Tuned Random Forest_cv", RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)),
    ("Random Forest_cv", RandomForestRegressor()),
    ("Tuned Decision Tree_cv", DecisionTreeRegressor(min_samples_split= 5, max_depth= 20, criterion='friedman_mse'))
    ]
In [456]:
res_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])
In [457]:
top3models= models_regression_performance_cv_fit(top3models, X_train, y_train, X_val, y_val,res_df)
In [458]:
top3models.sort_values(by="R2 Score", ascending=False)
Out[458]:
Model MAE MSE RMSE R2 Score
0 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
2 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
1 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
3 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
4 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
5 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
  • Tuned Random Forest is the best performing model among the cross-validated models
In [459]:
final_results_df = pd.concat([results_df, top3models], ignore_index=True)
final_results_df
Out[459]:
Model MAE MSE RMSE R2 Score
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
4 Random Forest 0.007321 0.000585 0.024185 0.990246
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
10 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
11 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
12 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
13 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
14 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
15 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
In [460]:
# Display the results DataFrame
final_results_df.sort_values(by="R2 Score", ascending=False)
Out[460]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
10 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
12 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
4 Random Forest 0.007321 0.000585 0.024185 0.990246
11 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
13 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
14 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
15 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [461]:
final_results_df.sort_values(by="MAE")
Out[461]:
Model MAE MSE RMSE R2 Score
4 Random Forest 0.007321 0.000585 0.024185 0.990246
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
10 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
12 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
13 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
11 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
15 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
14 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [462]:
final_results_df.sort_values(by="MSE")
Out[462]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
10 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
4 Random Forest 0.007321 0.000585 0.024185 0.990246
12 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
11 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
13 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
14 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
15 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
In [463]:
final_results_df.sort_values(by="RMSE")
Out[463]:
Model MAE MSE RMSE R2 Score
7 Tuned Random Forest 0.007344 0.000561 0.023676 0.990653
10 Tuned Random Forest_cv_train 0.008073 0.000570 0.023871 0.990497
4 Random Forest 0.007321 0.000585 0.024185 0.990246
12 Random Forest_cv_train 0.008088 0.000589 0.024270 0.990342
11 Tuned Random Forest_cv_val 0.008124 0.000651 0.025522 0.989572
13 Random Forest_cv_val 0.008107 0.000652 0.025542 0.989556
8 Tuned Decision Tree 0.010478 0.000859 0.029309 0.985676
3 Decision Tree 0.010219 0.000944 0.030730 0.984253
14 Tuned Decision Tree_cv_train 0.012006 0.000972 0.031180 0.983627
15 Tuned Decision Tree_cv_val 0.011874 0.001130 0.033613 0.981912
9 Tuned Support Vector Regressor 0.040213 0.002837 0.053267 0.952687
6 Support Vector Regressor 0.043635 0.003353 0.057902 0.944096
0 Linear Regression 0.042146 0.004032 0.063495 0.932773
2 Ridge Regression 0.042285 0.004047 0.063619 0.932510
5 K-Nearest Neighbors 0.059559 0.009189 0.095859 0.846773
1 Lasso Regression 0.196899 0.059972 0.244891 -0.000026
  • From all models evaluated, Tuned Random Forest turns out to be the best performing model

Consolidated notes on Modeling¶

  • Defined function "evaluate_model(model, x_test, y_test)", to evaluate and return the model's metrics into a results dataframe
  • Defined function "models_regression_performance_cv_fit(models, x_train, y_train, x_test, y_test, res_df=None)", to compute different metrics to check regression model performance
  • Data preparation for modeling over a copy named "data"
  • Min-Max scaling was applied. The dataset has features with different scales, normalization ensures that no feature dominates the learning process.
  • Added constant for modeling
  • Data split 70/30. Shape of Training set : (5419, 22), Shape of test set : (2323, 22)
  • Performance Metrics:
    • MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
    • MSE (Mean Squared Error): Measures the average of the squares of the errors, giving more weight to larger errors.
    • RMSE (Root Mean Squared Error): The square root of MSE, providing error in the same units as the target variable.
    • R2 Score (Coefficient of Determination): Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
  • Evaluated the following regression models:
    • Linear Regression
    • Lasso Regression
    • Ridge Regression
    • Decision Tree
    • Random Forest
    • K-Nearest Neighbors
    • Support Vector Regressor
  • Random Forest: Best Performance. It has the lowest MAE (0.007321), MSE (0.000585), and RMSE (0.024185), and the highest R2 Score (0.990246), indicating it is the most accurate model among the ones tested.
  • Decision Tree: Second Best. It also performs very well with low MAE, MSE, and RMSE, and a high R2 Score.
  • Support Vector Regressor (SVR): Good Performance. It has a relatively low MAE, MSE, and RMSE, with a high R2 Score.
  • Linear Regression and Ridge Regression: Similar Performance. Both have similar values for all metrics indicating decent performance.
  • K-Nearest Neighbors (KNN): Moderate Performance. It has higher MAE, MSE, and RMSE, with a lower R2 Score, indicating it is less accurate compared to the top models.
  • Lasso Regression: Poor Performance: It has the worst (highest) MAE, MSE, and RMSE, with a negative R2 Score, indicating it performs poorly on this dataset.
  • The three best performing models were selected for model tuning: Random Forest, Decision Tree, and Support Vector Regression.
  • After hyperparameter tuning, Tuned Random Forest, Random Forest, and Tuned Decision Tree are the best performing models
  • The 3 best performing models after hyperparameter tuning will be evaluated with CrossValidation
  • 60-20-20 data split for training-validation-testing
  • Tuned Random Forest is the best performing model among the cross-validated models
  • From all models evaluated, Tuned Random Forest turns out to be the best performing model

Final Model¶

In [464]:
# Define the model with the selected hyperparameters
RandomForestTuned = RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)

# Train the model on the entire training dataset
RandomForestTuned.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = RandomForestTuned.predict(X_test)
In [465]:
evaluate_model(RandomForestTuned, X_test, y_test)
Out[465]:
{'MAE': np.float64(0.007490556570714266),
 'MSE': np.float64(0.0005540728556713073),
 'RMSE': np.float64(0.023538752211434387),
 'R2 Score': 0.9905977292708693}
In [466]:
# Feature importance
feature_importances = pd.Series(RandomForestTuned.feature_importances_, index=X_train.columns)
feature_importances = feature_importances.sort_values(ascending=False)

# Plotting
plt.figure(figsize=(10, 6))
feature_importances.plot(kind='bar')
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.show()
No description has been provided for this image
In [467]:
# Crear el gráfico de dispersión con la línea de regresión
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, color='blue')
plt.show()
No description has been provided for this image

Conclusions¶

  • square_meters is the most important variable to predict unit price and at least an aproximated value should by provided as input.
  • Having square_meters_price as the second most important variable, and considering it might be an unknown input, the model could consider mean by to inpute.
  • The most accurate model identified to predict price is: RandomForestRegressor(max_depth= 20, min_samples_split= 5, n_estimators=300)
  • With this model the performance achieved is MAE:0.00749, MSE:0.000554, RMSE:0.023538, and R2 Score: 0.990597
  • Results obtained with scaled data (Min-Max scaling)

Deploy¶

In [470]:
# Save model and scaler
joblib.dump(RandomForestTuned, "Random_Forest_Tuned.pkl")
Out[470]:
['Random_Forest_Tuned.pkl']
In [471]:
joblib.dump(input_scaler, "Input_scaler.pkl")
Out[471]:
['Input_scaler.pkl']
In [472]:
joblib.dump(output_scaler, "Output_scaler.pkl")
Out[472]:
['Output_scaler.pkl']
  • Using the joblib library in Python to save Python objects like machine learning models, scalers, and other data-processing objects to disk.
  • It allows to efficiently store these objects and later reload them for future use, avoiding the need to retrain models or reinitialize objects.
  • joblib is commonly used for saving trained models because it’s optimized for handling large amounts of data, especially with NumPy arrays or objects from scikit-learn.
In [473]:
!jupyter nbconvert --to html PROJECT1_TEAM2_CM.ipynb
C:\Users\otroc\AppData\Local\Programs\Python\Python313\Scripts\jupyter-nbconvert.EXE\__main__.py:4: DeprecationWarning: Parsing dates involving a day of month without a year specified is ambiguious
and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
[NbConvertApp] Converting notebook PROJECT1_TEAM2_CM.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 19 image(s).
[NbConvertApp] Writing 4377591 bytes to PROJECT1_TEAM2_CM.html